Analysis of 2016 NC Presidential Campaign Contributions by Matthew Bellissimo

output: html_document

I wanted to Analyze the 2016 NC Presidential Campaign Contributions

Univariate Plots Section

##  [1] "cmte_id"           "cand_id"           "cand_nm"          
##  [4] "contbr_nm"         "contbr_city"       "contbr_st"        
##  [7] "contbr_zip"        "contbr_employer"   "contbr_occupation"
## [10] "contb_receipt_amt" "contb_receipt_dt"  "receipt_desc"     
## [13] "memo_cd"           "memo_text"         "form_tp"          
## [16] "file_num"          "tran_id"           "election_tp"
## 'data.frame':    2319 obs. of  18 variables:
##  $ cmte_id          : Factor w/ 14 levels "C00458844","C00500587",..: 6 6 6 6 6 5 5 5 5 5 ...
##  $ cand_id          : Factor w/ 14 levels "P00003392","P20002721",..: 1 1 1 1 1 4 4 4 4 4 ...
##  $ cand_nm          : Factor w/ 15 levels "Bush, Jeb","Carson, Benjamin S.",..: 3 3 3 3 3 11 11 11 11 11 ...
##  $ contbr_nm        : Factor w/ 1036 levels "ACQUAVIVA, TONY",..: 1014 604 779 495 474 257 509 444 480 149 ...
##  $ contbr_city      : Factor w/ 227 levels "ABERDEEN","ADVANCE",..: 52 209 115 166 224 137 150 8 58 36 ...
##  $ contbr_st        : Factor w/ 1 level "NC": 1 1 1 1 1 1 1 1 1 1 ...
##  $ contbr_zip       : int  277132233 273707743 287487012 273125862 271045057 281108912 278569198 288042811 286213012 282266472 ...
##  $ contbr_employer  : Factor w/ 400 levels "","3RC","5 STAR AWARDS",..: 238 296 238 238 238 284 177 177 218 294 ...
##  $ contbr_occupation: Factor w/ 322 levels "","1ST GRADE TEACHER ASSISTANT",..: 183 295 261 261 261 261 132 132 135 193 ...
##  $ contb_receipt_amt: num  100 75 112 100 500 50 250 2700 250 500 ...
##  $ contb_receipt_dt : Factor w/ 126 levels "1-Apr-15","1-Jun-15",..: 108 14 14 86 13 3 117 23 117 39 ...
##  $ receipt_desc     : Factor w/ 12 levels "","REATTRIBUTION / REDESIGNATION REQUESTED (AUTOMATIC)",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ memo_cd          : Factor w/ 2 levels "","X": 1 1 1 1 1 1 1 1 1 1 ...
##  $ memo_text        : Factor w/ 19 levels "","* EARMARKED CONTRIBUTION: SEE BELOW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ form_tp          : Factor w/ 3 levels "SA17A","SA18",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ file_num         : int  1015585 1015585 1015585 1015585 1015585 1015683 1015683 1015683 1015683 1015683 ...
##  $ tran_id          : Factor w/ 2315 levels "A020838FE8B6E4ABD8E6",..: 179 338 339 428 561 118 150 29 83 176 ...
##  $ election_tp      : Factor w/ 3 levels "","G2016","P2016": 3 3 3 3 3 3 3 2 3 3 ...

Candidate Names

##  [1] "Bush, Jeb"                 "Carson, Benjamin S."      
##  [3] "Clinton, Hillary Rodham"   "Cruz, Rafael Edward 'Ted'"
##  [5] "CRUZ, RAFAEL EDWARD TED"   "Fiorina, Carly"           
##  [7] "Graham, Lindsey O."        "Huckabee, Mike"           
##  [9] "Jindal, Bobby"             "O'Malley, Martin Joseph"  
## [11] "Paul, Rand"                "Perry, James R. (Rick)"   
## [13] "Rubio, Marco"              "Sanders, Bernard"         
## [15] "Santorum, Richard J."

Count of Unique Contributors

## [1] 1036

Count of Unique Occupations

## [1] 322

Top 10, Cities and Contributor Occupations

##     CHARLOTTE       RALEIGH    GREENSBORO   CHAPEL HILL WINSTON SALEM 
##           287           200            98            90            82 
##     ASHEVILLE        DURHAM          CARY    WILMINGTON        LELAND 
##            80            80            74            50            48
##                                RETIRED 
##                                    732 
##                           NOT EMPLOYED 
##                                    126 
## INFORMATION REQUESTED PER BEST EFFORTS 
##                                    107 
##                              HOMEMAKER 
##                                    101 
##                               ATTORNEY 
##                                     75 
##                  INFORMATION REQUESTED 
##                                     45 
##                              PHYSICIAN 
##                                     42 
##                      ACCOUNT ASSISTANT 
##                                     37 
##   BUILDING ASSISTANT MANAGER/CUSTODIAN 
##                                     35 
##                             CONSULTANT 
##                                     28

Summary of Receipt Amount

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -5400.0    40.0   100.0   375.9   250.0 10800.0

Rceipt Desciptions

##  [1] ""                                                                             
##  [2] "REATTRIBUTION / REDESIGNATION REQUESTED (AUTOMATIC)"                          
##  [3] "REATTRIBUTION / REDESIGNATION REQUESTED (AUTOMATIC) REATTRIBUTION FROM SPOUSE"
##  [4] "REATTRIBUTION / REDESIGNATION REQUESTED (AUTOMATIC) REATTRIBUTION TO SPOUSE"  
##  [5] "REATTRIBUTION / REDESIGNATION REQUESTED (AUTOMATIC) SEE REATTRIBUTION"        
##  [6] "REATTRIBUTION FROM SPOUSE"                                                    
##  [7] "REATTRIBUTION TO SPOUSE"                                                      
##  [8] "REATTRIBUTION/REDESIGNATION REQUESTED"                                        
##  [9] "REDESIGNATION FROM PRIMARY"                                                   
## [10] "REDESIGNATION TO GENERAL"                                                     
## [11] "Refund"                                                                       
## [12] "SEE REATTRIBUTION"

I noticed was that Ted Cruz was being counted twice. I used tidyr to fix this.

I also created an entirely new I also added another column called ‘Day’ with the date represented as numeric value.

## 'data.frame':    2319 obs. of  19 variables:
##  $ cmte_id          : Factor w/ 14 levels "C00458844","C00500587",..: 13 13 13 13 13 13 13 13 13 13 ...
##  $ cand_id          : Factor w/ 14 levels "P00003392","P20002721",..: 12 12 12 12 12 12 12 12 12 12 ...
##  $ contbr_nm        : Factor w/ 1036 levels "ACQUAVIVA, TONY",..: 903 147 155 314 560 637 741 879 113 186 ...
##  $ contbr_city      : Factor w/ 227 levels "ABERDEEN","ADVANCE",..: 36 168 224 222 67 35 88 214 48 200 ...
##  $ contbr_st        : Factor w/ 1 level "NC": 1 1 1 1 1 1 1 1 1 1 ...
##  $ contbr_zip       : int  282103248 276146201 27106 284054795 287340027 275178398 272657650 281736806 280367104 273589364 ...
##  $ contbr_employer  : Factor w/ 400 levels "","3RC","5 STAR AWARDS",..: 245 203 285 148 284 284 296 34 317 395 ...
##  $ contbr_occupation: Factor w/ 322 levels "","1ST GRADE TEACHER ASSISTANT",..: 181 318 101 145 261 261 15 179 100 86 ...
##  $ contb_receipt_dt : Factor w/ 126 levels "1-Apr-15","1-Jun-15",..: 25 99 44 90 29 29 77 90 99 99 ...
##  $ receipt_desc     : Factor w/ 12 levels "","REATTRIBUTION / REDESIGNATION REQUESTED (AUTOMATIC)",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ memo_cd          : Factor w/ 2 levels "","X": 1 1 1 1 1 1 1 1 1 1 ...
##  $ memo_text        : Factor w/ 19 levels "","* EARMARKED CONTRIBUTION: SEE BELOW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ form_tp          : Factor w/ 3 levels "SA17A","SA18",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ file_num         : int  1015075 1015075 1015075 1015075 1015075 1015075 1015075 1015075 1015075 1015075 ...
##  $ tran_id          : Factor w/ 2315 levels "A020838FE8B6E4ABD8E6",..: 638 696 647 685 640 641 660 678 702 704 ...
##  $ election_tp      : Factor w/ 3 levels "","G2016","P2016": 3 3 3 3 3 3 3 3 3 3 ...
##  $ cand_nm          : Factor w/ 14 levels "Bush, Jeb","Carson, Benjamin S.",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ contb_receipt_amt: num  250 1350 2700 2000 250 1000 2700 2500 250 2700 ...
##  $ Days             : num  286 301 290 300 287 287 297 300 301 301 ...

When I looked at a summary of Contribution amounts, i noticed that a number of values were negative. Looking at the reciept desciptions showed that these values actually represented refunds.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       2      50     100     397     250   10800

Excluding refunds from the dataset didn’t show a significant change in the summary values, so i left them in.

I decided to make two summary tables so that i could go over the data at a higher level. One was for candidate information, the other for donor information.

## [1] 568   6
## [1] 468   6

There are 14 Candidates and 1036 Unique Contributors 436 contributors contributed more than once. The maximum contribution was $10800. The largest refund was $-5400. The smallest contribution was $2 The interquartile range of Contributions was between $40 and $250 The median contribution was 100 The mean contribution was 375.9. Contributor occupation varied widely,though the largest group was composed of Retirees.

The largest number of contributions went to Ben Carson(594) on the Republican side, followed by Ted Cruz(425), and then Democratic Candidate Hillary Clinton(414). Rick Perry came in last with only one contribution. The largest contribution (10800) went to Ben Carson. Jeb Bush had the highest Mean and Median contribution, 2034 and 2700 respectively.

To begin, i plotted a basic histogram

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

I could see that most contributions were clustered slightly above 0, and included small numbers of larger positive and negative contributions. Unfortunately, it was difficult to infer any more information beyond this, so I decided to play around with the binwidth, and add some more detail to x axis.

The new plot showed that most contributions were clustered between 0 and 1000, with a spike of contributions occuring right below 3000, and a small number of much larger contributions above that. It also appeared that most refunds occured around this 3000 dollar level. Still, it was difficult to see the outliers, as there were very few of them. I didn’t want to ignore these values,however, because despite there not being a lot of them, i knew that large contributions are very important in political elections. To remedy this situation, I decided to perform a log10 transformatino on the x axis.

## Warning in scale$trans$trans(x): NaNs produced

This showed the clearest picture yet for how contributions occured. In addition to being normally distributed, i could now see, in a much higher level of detail, the scale and frequency of different contributions. The only downside here was that the log transformation turned all of the negative data values,0, which meant that i lost all of the information on refunds.

I also noticed that there were major gaps in between the contribution values I thought this was a really interesting trend. Contribution values mostly seemed to be in whole, even, values, and were almost all a factor 25.I believe this is an example of the psychological trend of mental anchoring, as contributors are tending to contribute in familiar numbers(ex: 1000 is a more common contribution than 980).In any case, the result was that contributions looked like they mostly occured in discrete increments.

Because of phenomenom, and the because i really did want to include refunds in my plot, I decided it would probably make sense to use R’s cut function.

## (-6000,-3000]     (-3000,0]         (0,5]        (5,10]       (10,25] 
##             1            21            70           110           303 
##       (25,50]      (50,100]     (100,250]     (250,500]    (500,1000] 
##           399           511           431           188            86 
##   (1000,2750]   (2750,5500]  (5500,11000] 
##           177            18             4

I figured it would also make sense to create a plot to show how the Candidates ranked in terms of number of donations recieved. We can see that only 3 of the 14 candidates(Ben Carson,Ted Cruz, and Hillary Clinton) broke 400 contributions since the start of their campaign. # Univariate Analysis ### What is the structure of your dataset? In this dataset there were 2319 contributions, made to 14 different candidates, by 1036 unique contributors. 436 contributors gave money more than once. The largest amount contributed in this dataset was $10800 and was given to Ben Carson. The biggest refund was for $5400 and refunded by Ted Cruz. The smallest value contributed was for $2, and the interquartile range of donations excluding refunds was between $50 and $397. Again, excluding refunds, the median contribution was $100 and the mean contribution was $376. The occuption of contributors seeemed to vary widely, as 322 unique values were listed. Republican Ben Carson recieved the largest number of contributions(594), followed by other Republican Ted Cruz(425), and then democratic candidate Hillary Clinton(414).Rick Perry came in last with only one contribution made to his campaign. Republican Jeb Bush had the highest Mean and Median contributions, 2034 and 2700 respectively.

There were 18 original features in the dataset.These were:

[cmte_id,cand_id,cand_nm,contbr_nm,contbr_city,contbr_st,contbr_zip,contbr_employer,contbr_occupation,contb_receipt_amt,contrb_receibt_dt,receipt_desc,memo_cd,memo_text,form_tp,file_num,tran_id,form_tp,file_num,tran_id,election_tp]

What is/are the main feature(s) of interest in your dataset?

The main features of interest in my dataset are cand_nm,contb_receipt_amt.Most of the things i am interested in investigating are related to the success of the candidate. The frequency and scale of contributions, i believe, plays a large role in this.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Other features that might make sense to look at are contbr_nm, contbr_city, and contb_receipt_dt. It would be interesting to see if something like the 80-20 rule applies here, with 80% of contributions being given by 20% of contributors. I believe it would also be helpful to see the locations where different contributions were coming from, maybe to see if certain areas were more valuable in terms of campaigning. Finnally it would make sense to look at when different donations occured, to see if certain candidates have been trending up or down as their campaign has progressed.

Did you create any new variables from existing variables in the dataset?

I created a variable called contb_receipt_amt.Bucket, from contb_receipt_amt using R’s cut function. The purpose of this was to get around the fact that most donations were occuring in seemingly discreete increments. I also added another column called ‘Day’ with the date represented as numeric value(I subtracted the minimum date from all dates in the vector. I did this so that i would be easier to plot with this information later on.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

When the data was first loaded.Ted Cruz had to cand_nm levels devoted to him, which was a mistake. To fix this i used tidyr’s spread function, to make those variables columns, then i added values of the vector of the extra one to the values of the vector of the first one. I then gathered the updated columns and deleted the extra one.

Bivariate Plots Section

I decided to visualize some of the summary table information The first question i wanted to answer was “Who recieved the most money?”

It turns out that while Ben Carson was still in the lead for Republians, Hillary Clinton was actually the candidate who raised the most money so far.

Let’s see how candidates ranked up in terms of Mean Donations.

Looks like a close contest between Jeb Bush, Mike Huckabee,Lindsey Graham, and Martin O Mall. That’s really strange. O’ Mally’s been polling at close to 0 since the start of the race. I wonder what the plot would look like if we used Median Contributions instead.

Looks like the same 4 Candidates as before, except now Jeb Bush is blowing everybody out of the water.Also interesting, is the Ben Carson, who led the pack in terms of the number of donations and maximum donations, is actually now closer to the tail of the plot. Let see how maximum donations played into the mix.

Ok that’s reallly intesting. Except for Mike Huckabee, the top 4 leaders in mean and median aren’t the leaders in the largest donations. Ben Carson actually leads here, which is weird because he was ranked in the middle for Mean Donations, and large donations should have skewed that value upwards with him. Maybe it was just a fluke. In any case there was only one way to be sure. Time for a boxplot.

So now this makes more sense. It looks like most of Jeb Bush’s donations were above the $2500 mark, while ben carsons donations were all much smaller. Ben did get the largest donation,and he does have a few outliers, but it looks the like great majority of,his support has been from small donations. Martin O Mally, we can also now see that he hasn’t had a lot of donations, and the one’s he’s had were mostly small scale. However, because the frequency of donation for him is so low, the larger donations he did get probably skewed his mean upwards just because that number is so variable.

Here’s something similar using geom_point

Let’s quickly look where, in terms of location, this money is coming from. Looks like most of the money is coming from Charlotte and Raleigh.

We talked about the 80-20 rule, lets make another plot to see if thats actually playing out here.

Wow, so whoever, Mr Harrison J Frank III is i’d like to meet him since i guess he had the disposable income to more than double the contribution of the next biggest contributor. In terms of the 80:20 rule, the top 20% of contributors(Top 207), made up around 67.4% of total contributions While the top 3% of contributors(30) made up just over 20% of total contributions So this is a BIG skew.

Sum of Total Contributions

## [1] 871792.3

Sum of Top 20%(207) Contributors

## [1] 587372.3

Percentage of Total Contributions made up the top 20% of Contributors

## [1] 0.6737525

Sum of Top 30 Contributors

## [1] 177679.5

Percentage of Total Contributions made up the top 3%(30) of Contributors

## [1] 0.2038094

The last thing i really wanted to look at in this section was to see how the frequency of contributions compared to Total Contributions I could see that the highest concentration of donors were one offs, and that the range of their donations varied widely. I decided that rather attempt a subjective intpretation, it might make sense to just perform a linear regression.

## 
## Call:
## lm(formula = newlog(TotalContributions) ~ FreqOfContributions, 
##     data = Contributions.Summary.Con)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.0238 -0.5165 -0.3059  0.8698  4.2222 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          6.05204    0.04875 124.138   <2e-16 ***
## FreqOfContributions -0.01412    0.01404  -1.006    0.315    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.199 on 1034 degrees of freedom
## Multiple R-squared:  0.000977,   Adjusted R-squared:  1.086e-05 
## F-statistic: 1.011 on 1 and 1034 DF,  p-value: 0.3148

The Coefficient for FreqOfContributions was very close to zero, and the p value was very high. This meant that we could not reject the null hypothesis that the coefficient on FrequencyOfContributions should actually be zero, which meant it didn’t appear that there was a relationship between these two values.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

The strongest relationships i saw were betweeen mean donation and median donation. Jeb was in the lead on both, despite not having the most donations, or the largest donations. It looks like at this point in the campaign season it may be too early for money to matter. Another alternative is that most of the contributions that matter are not listed in the Federal Election Commitee data, as it has instead gone to super pacs. I can say that after playing around with mean, median,max, frequency, and total sum of donations that the same 4 candidates kept popping up in different orders. These were Hillary Clinton, Ben Carson, Ted Cruz, and Jeb Bush. 3 of these four ranked highly in terms of the frequency of donations, Jeb Bush was the only one who didn’t score highly there. This was probably mitigated by the average size of his donations, as they were on average much stronger than his peers. Perhaps the strongest argument that could be said is that are only around 4 or 5 credible candidates that are listed in the dataset, and that everybody below this threshold for the most part performs poorly.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

I saw that the most money came out of Charlotte. I also saw that a dispraportionate amount of money came from a small number of super contributors. It blew my mind that 3% of donors made up 20% of total contributions.

What was the strongest relationship you found?

The relationship between median and mean donations. Jeb bush was solidly in the lead with both.

Multivariate Plots Section

##  [1] "Bush, Jeb"                 "Carson, Benjamin S."      
##  [3] "Clinton, Hillary Rodham"   "Cruz, Rafael Edward 'Ted'"
##  [5] "Fiorina, Carly"            "Graham, Lindsey O."       
##  [7] "Huckabee, Mike"            "Jindal, Bobby"            
##  [9] "O'Malley, Martin Joseph"   "Paul, Rand"               
## [11] "Perry, James R. (Rick)"    "Rubio, Marco"             
## [13] "Sanders, Bernard"          "Santorum, Richard J."

I wanted look at how contributions rose or fell for candidates over time. Immediately ,however,i noticed a problem. There are 15 candidates. This is a large number for a legend, and is compounded by the fact that the colors are effectively uninterpretable.

To mitigate this issue i decided to ignore outlier candidates.

## NULL
## [1] Bush, Jeb                 Carson, Benjamin S.      
## [3] Clinton, Hillary Rodham   Cruz, Rafael Edward 'Ted'
## [5] Paul, Rand                Rubio, Marco             
## [7] Sanders, Bernard         
## 7 Levels: Bush, Jeb Carson, Benjamin S. ... Sanders, Bernard

I was more or less subjective regarding which candidates to exclude, but i used for reference the bivariateplots shown above in terms of which candidates performed the poorest.

## Warning: Removed 20 rows containing missing values (geom_point).

This was slightly better but there was still a lot of clutter. There were a large number of dots all very close to each other and in many cases they overlapped. Using alpha helped some with this issue, but it still wasn’t ideal.

##       cmte_id   cand_id            contbr_nm   contbr_city contbr_st
## 173 C00579458 P60008059  STRUHS, SARA W. MS.     CHARLOTTE        NC
## 177 C00579458 P60008059 BURBAGE, MICHAEL MR.       RALEIGH        NC
## 178 C00579458 P60008059       CAMERON, SUSAN WINSTON SALEM        NC
## 186 C00579458 P60008059   GARNER, STEVEN MR.    WILMINGTON        NC
## 193 C00579458 P60008059    LINDSTROM, MARCIA      FRANKLIN        NC
## 195 C00579458 P60008059      MCNEEL, RICHARD   CHAPEL HILL        NC
##     contbr_zip    contbr_employer contbr_occupation contb_receipt_dt
## 173  282103248               NONE              NONE        15-Jun-15
## 177  276146201             LENOVO        WEB DESIGN        30-Jun-15
## 178      27106  REYNOLDS AMERICAN         EXECUTIVE        19-Jun-15
## 186  284054795 GARNER COMPANY LLC          INVESTOR        29-Jun-15
## 193  287340027            RETIRED           RETIRED        16-Jun-15
## 195  275178398            RETIRED           RETIRED        16-Jun-15
##     receipt_desc memo_cd memo_text form_tp file_num     tran_id
## 173                                  SA17A  1015075 SA17.110765
## 177                                  SA17A  1015075 SA17.118375
## 178                                  SA17A  1015075 SA17.113064
## 186                                  SA17A  1015075 SA17.117385
## 193                                  SA17A  1015075 SA17.112168
## 195                                  SA17A  1015075 SA17.112238
##     election_tp   cand_nm contb_receipt_amt Days contb_receipt_amt.Bucket
## 173       P2016 Bush, Jeb               250  286                (100,250]
## 177       P2016 Bush, Jeb              1350  301              (1000,2750]
## 178       P2016 Bush, Jeb              2700  290              (1000,2750]
## 186       P2016 Bush, Jeb              2000  300              (1000,2750]
## 193       P2016 Bush, Jeb               250  287                (100,250]
## 195       P2016 Bush, Jeb              1000  287               (500,1000]
##     Months
## 173      9
## 177     10
## 178      9
## 186     10
## 193      9
## 195      9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -5400.0    35.0   100.0   340.9   250.0 10800.0
## Warning in scale$trans$trans(x): NaNs produced
## Warning: Removed 20 rows containing missing values (stat_summary).

## Warning in scale$trans$trans(x): NaNs produced
## Warning: Removed 20 rows containing missing values (stat_summary).

So this was a much clearer picture. We can see that Rand Paul started off campaigning early, and suceeded in attracting a lot of large contributions, however, as time went on and more candidates entered the race, the Mean and Median amounts donated to him dropped of significantly and started to average out to the amounts donated to the other candidates. We can also see that Jeb bush started getting money a month or two, and again the amounts were significantly larger than the meanand and median amounts donated to the other candidates.

Next I wanted to take another look at Contributions by City. Specifically, i wanted to see the breakdown of donations by Candidat for these different locations. Interesting. It seem that in the largest donation cities, the most money is going to Hillary Clinton and Ben Carson respectively, with Jeb Bush coming in third.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Jeb Bush again won hands down when we started to look at mean and median contributions over time. We also saw that part of the reason for this was because he almost always recieves mid level donations. Another interesting point was that the same three candidates that recieved the most Total Contributions come up in the same rankings when looking at contributions by location. Again it was Hillary Clinton in 1st place, Ben Carson in 2nd, and Jeb Bush in 3rd.

I do not feel comfortable, based on this data, that i will be able to build an effective linear model for Campaign Donations. There are just too many outside variable to take into consideration (Median, Population, Debates). Other than Location of Contributors, i think that the issues driving the success in donations for these different candidates, are factors that aren’t represented in this dataset.

Were there any interesting or surprising interactions between features?

I was suprised that Jeb Bush Started receiving money so much later than the other candidates, and Rand Paul so early. It was also interesting to see that the mean donations for Rand Paul were initailly much higher before other candidates started receiving funds.There was a significant drop off following this occurence. Finally it was interesting to see how much Location plays a role in fundraising. It does not seem like it would be a coincidence that the same three candidates leading in Total Contributions, are the same three candidates leading in terms of amounts contributed by the big money giving cities(and in the same order as well!).

Final Plots and Summary

Plot One

## Warning in scale$trans$trans(x): NaNs produced
## Warning: Removed 20 rows containing missing values (stat_summary).
## Warning: Removed 5 rows containing missing values (geom_point).

Description One

Jeb Bush started raising funds last, but was able to jump to third place in terms of total funds raised. Clinton has been in the lead since she started raising cash, and Paul Rand, despite getting contributions 6 months before anybody else, has fallen to the middle of the pack. Additionally we can see that his mean contribution size has fallen steadily ever since the other candidates started getting money.This could be due to getting a large number of smaller donations now that it’s getting later into the race.

Plot Two

## Warning in scale$trans$trans(x): NaNs produced
## Warning: Removed 1 rows containing non-finite values (stat_density).
## Warning: Removed 7 rows containing non-finite values (stat_density).
## Warning: Removed 4 rows containing non-finite values (stat_density).
## Warning: Removed 7 rows containing non-finite values (stat_density).
## Warning: Removed 1 rows containing non-finite values (stat_density).

Description Two

Again we can see that while Jeb Bush is the weakest of the ‘frontrunners’ in terms of the number of his donations, the contributions he does have are quite large. This differentiates him from the remaining candidates as the distribution of their contributions are centered further to the left, and more or less follow the same patterns.

Plot Three

Description Three

Ben carson has the most normalized distribution of Contributions made out to him using log10 and buckets on the x axis. Jeb Bush’s distribution is centered further to the right at the $500-$1000 range, with a number of larger and smaller donations above and below this. The total frequency of his contributions seems to be lower. Bernie sanders seeems to actually mimic ben Carson a lot in terms of the frequency and scale of his distributions, however, it’s easy to see why hillary has been so successful fund raising, as she is represented strongly across all of the buckets.

Reflection

The Contributions data set contains information on 14 presidential candidates and all of their registered contributors in the State of North Carolina. I started by understanding the individual variables, and how their levels were structured. Then i processed the dataset to fix an issue with duplicated values. After analyzing the spread of values i realized that it made the most sense to analze contributions in ‘buckets’ with different min and maximum thresholds. I also saw that different candidates had different patterns in terms of their contributions. Jeb Bush stood out as the biggest outlier in this regard,given that, despite recieveiving contributinos the latest, and recieving the fewest number of contributions out of all the frontrunner candidates, the mean and median contributions he did receive were far larger than those of his peers. Hillary Clinton was the strongest candidate in terms of total contributions, though Ben Carson, a political outsider in many way performed strongly. I was unable to create a linear model to predict the Total Contributions for each candidate, as the data was too variable and unrepresentive of many key drivers in campaigning. Maybe as things get later into the campaign it would make more sense, but as right now i think it is too early. One thing that would make this task easier is to have access to the list of Major Super Pacs working on behalf of these different candidates. I know from experience that the amounts represented in the Federal Election Commitee datasets, are just a small fraction of the total amounts contributed to the candidates during political campaigns.